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Abstract 

Given  n  observations  of  a  p-dimensional  random  vector,  the  covariance  matrix  and  its  inverse 
(precision  matrix)  are  needed  in  a  wide  range  of  applications.  Sample  covariance  (e.g.  its 
eigenstructure)  can  misbehave  when  p  is  comparable  to  the  sample  size  n.  Regularization  is 
often  used  to  mitigate  the  problem. 

In  this  paper,  we  proposed  an  £i  penalized  pseudo-likelihood  estimate  for  the  inverse  covari¬ 
ance  matrix.  This  estimate  is  sparse  due  to  the  £i  penalty,  and  we  term  this  method  SPLICE. 
Its  regularization  path  can  be  computed  via  an  algorithm  based  on  the  homotopy/LARS-Lasso 
algorithm.  Simulation  studies  are  carried  out  for  various  inverse  covariance  structures  for  p  =  15 
and  n  =  20, 1000.  We  compare  SPLICE  with  the  £i  penalized  likelihood  estimate  and  a  £\  pe¬ 
nalized  Cholesky  decomposition  based  method.  SPLICE  gives  the  best  overall  performance  in 
terms  of  three  metrics  on  the  precision  matrix  and  ROC  curve  for  model  selection.  Moreover, 
our  simulation  results  demonstrate  that  the  SPLICE  estimates  are  positive-definite  for  most  of 
the  regularization  path  even  though  the  restriction  is  not  enforced. 
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1  Introduction 


Covariance  matrices  are  perhaps  the  simplest  statistical  measure  of  association  between  a  set  of 
variables  and  widely  used.  Still,  the  estimation  of  covariance  matrices  is  extremely  data  hungry, 
as  the  number  of  fitted  parameters  grows  rapidly  with  the  number  of  observed  variables  p.  Global 
properties  of  the  estimated  covariance  matrix,  such  as  its  eigenstructure,  are  often  used  (e.g.  Prin¬ 


cipal  Component  Analysis,  Jolliffe,  2002).  Such  global  parameters  may  fail  to  be  consistently 
estimated  when  the  number  of  variables  p  is  non-negligible  in  the  comparison  to  the  sample  size 
n.  As  one  example,  it  is  a  well-known  fact  that  the  eigenvalues  and  eigenvectors  of  an  estimated 


covariance  matrix  are  inconsistent  when  the  ratio  -  does  not  vanish  asymptotically  (Marchenko 


and  Pastur,  1967)  Paul  et  al.  2008).  Data  sets  with  a  large  number  of  observed  variables  p  and 


small  number  of  observations  n  are  now  a  common  occurrence  in  statistics.  Modeling  such  data  sets 
creates  a  need  for  regularization  procedures  capable  of  imposing  sensible  structure  on  the  estimated 
covariance  matrix  while  being  computationally  efficient. 

Many  alternative  approaches  exist  for  improving  the  properties  of  covariance  matrix  estimates. 


Shrinkage  methods  for  covariance  estimation  were  first  considered  in  Stein  ( 1975 ,  1986 )  as  a  way  to 


correct  the  overdispersion  of  the  eigenvalues  of  estimates  of  large  covariance  matrices.  Ledoit  and 


Wolf  (2004)  present  a  shrinkage  estimator  that  is  the  asymptotically  optimal  convex  linear  com¬ 


bination  of  the  sample  covariance  matrix  and  the  identity  matrix  with  respect  to  the  Froebenius 
norm. 


Daniels  and  Kass  ( 1999 ,  2001 1  propose  alternative  strategies  using  shrinkage  toward  diag¬ 


onal  and  more  general  matrices.  Faetorial  models  have  also  been  used  as  a  strategy  to  regularize 


estimates  of  covariance  matrices  (Fan  et  al.  2006).  Tapering  the  covariance  matrix  is  frequently 
used  in  time  series  and  spatial  models  and  have  been  used  recently  to  improve  the  performance  of 


covariance  matrix  estimates  used  by  classifiers  based  on  linear  discriminant  analysis  (Bickel  and 


Levina,  2004)  and  in  Kalman  filter  ensembles  (Furrer  and  Bengtsson,  2007).  Regularization  of  the 


covariance  matrix  can  also  be  achieved  by  regularizing  its  eigenveetors  (Johnstone  and  Lu  2004) 


Zou  et  al.  2006). 


Covarianee  seleetion  methods  for  estimating  covariance  matrices  consist  of  imposing  sparsity 
on  the  precision  matrix  (i.e.,  the  inverse  of  the  covariance  matrix).  The  Sparse  Pseudo-Likelihood 
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Inverse  Covariance  Estimates  (SPLICE)  proposed  in  this  paper  fall  into  this  category.  This  family  of 
methods  was  introduced  by  Dempster  (|1972 1 .  An  advantage  of  imposing  structure  on  the  precision 


matrix  stems  from  its  close  connections  to  linear  regression.  For  instance,  Wu  and  Pourahmadi 


(20031  use,  for  a  fixed  order  of  the  random  vector,  a  parametrization  of  the  precision  matrix  C 
in  terms  of  a  decomposition  C  =  U^DU  with  U  upper-triangular  with  unit  diagonal  and  D  a 
diagonal  matrix.  The  parameters  U  and  D  are  then  estimated  through  p  linear  regressions  and 


Akaike’s  Information  Criterion  (AIC,  Akaike  19731  is  used  to  promote  sparsity  in  U.  A  similar 


covariance  selection  method  is  presented  in  Bilmes  (2000|.  More  recently,  Bickel  and  Levina  (20081 
have  obtained  conditions  ensuring  consistency  in  the  operator  norm  (spectral  norm)  for  precision 
matrix  estimates  based  on  banded  Cholesky  factors.  Two  disadvantages  of  imposing  the  sparsity  in 
the  factor  U  are:  sparsity  in  U  does  not  necessarily  translates  into  sparsity  of  C  and;  the  sparsity 
structure  in  U  is  sensitive  to  the  order  of  the  random  variables  within  the  random  vector.  The 
SPLICE  estimates  proposed  in  this  paper  constitute  an  attempt  at  tackling  these  issues. 


The  AIC  selection  criterion  used  in  Wu  and  Pourahmadi  (20031  requires,  in  its  exact  form, 
that  the  estimates  be  computed  for  all  subsets  of  the  parameters  in  U.  A  more  computationally 
tractable  alternative  for  performing  parameter  selection  consists  in  penalizing  parameter  estimates 


by  their  £i-norm  (Breiman  1995;  Tibshirani  1996  Chen  et  al.  2001),  popularly  known  as  the 
LASSO  in  the  context  of  least  squares  linear  regression.  The  computational  advantage  of  the  ii- 
penalization  over  penalization  by  the  dimension  of  the  parameter  being  fitted  (£o-iiorm)  -  such 


as  in  AIC  -  stems  from  its  convexity  (Boyd  and  Vandenberghe  2004).  Homotopy  algorithms  for 
tracing  the  entire  LASSO  regularization  path  have  recently  become  available  ( Osborne  et  al.  2000| 


Efron  et  al.  2004).  Given  the  high-dimensionality  of  modern  days  data  sets,  it  is  no  surprise  that 


£i-penalization  has  found  its  way  into  the  covariance  selection  literature. 


Huang  et  al.  (2006)  propose  a  covariance  selection  estimate  corresponding  to  an  £i-penalty 


version  of  the  Cholesky  estimate  of  Wu  and  Pourahmadi  (2003).  The  off-diagonal  terms  of  U  are 
penalized  by  their  £i-norm  and  cross-validation  is  used  to  select  a  suitable  regularization  param¬ 
eter.  While  this  method  is  very  computationally  tractable  (an  algorithm  based  on  the  homotopy 
algorithm  for  linear  regressions  is  detailed  below  in  Appendix |B.1[),  it  still  suffer  from  the  deficien- 
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cies  of  Cholesky-based  methods.  Alternatively,  Banerjee  et  al.  (20051,  Banerjee  et  al.  (2007|,  Yuan 


and  Lin  (20071,  and  Friedman  et  al.  (20081  consider  an  estimate  defined  by  the  Maximum  Like¬ 


lihood  of  the  precision  matrix  for  the  Gaussian  case  penalized  by  the  £i-norm  of  its  off-diagonal 
terms.  While  these  methods  impose  sparsity  directly  in  the  precision  matrix,  no  path-following 


algorithms  are  currently  available  for  them.  Rothman  et  al.  (2007|  analyze  the  properties  of  esti¬ 
mates  defined  in  terms  of  £i-penalization  of  the  exact  Gaussian  neg-loglikelihood  and  introduce  a 
permutation  invariant  method  based  on  the  Cholesky  decomposition  to  avoid  the  computational 
cost  of  semi-definite  programming. 

The  SPLIGE  estimates  presented  here  impose  sparsity  constraints  directly  on  the  precision 
matrix.  Moreover  the  entire  regularization  path  of  SPLIGE  estimates  can  be  computed  by  homotopy 


algorithms.  It  is  based  on  previous  work  by  Meinshausen  and  Biihlmann  (20061  for  neighborhood 


selection  in  Gaussian  graphical  models.  While  Meinshausen  and  Biihlmann  (20061  use  p  separate 
linear  regressions  to  estimate  the  neighborhood  of  one  node  at  a  time,  we  propose  merging  all  p 
linear  regressions  into  a  single  least  squares  problem  where  the  observations  associated  to  each 
regression  are  weighted  according  to  their  conditional  variances.  The  loss  function  thus  formed 


can  be  interpreted  as  a  pseudo  neg-loglikelihood  (Besag,  1974 1  in  the  Gaussian  case.  To  this 
pseudo-negloglikelihood  minimization,  we  add  symmetry  constraints  and  a  weighted  version  of  the 
.^i-penalty  on  off-diagonal  terms  to  promote  sparsity.  The  SPLIGE  estimate  can  be  interpreted 


as  an  approximate  solution  following  from  replacing  the  exact  neg-loglikelihood  in  Banerjee  et  al. 


(20071  by  a  quadratic  surrogate  (the  pseudo  neg-loglikelihood). 

The  main  advantage  of  SPLIGE  estimates  is  algorithmic:  by  use  of  a  proper  parametrization, 
the  problem  involved  in  tracing  the  SPLIGE  regularization  path  can  be  recast  as  a  linear  regression 


problem  and  thus  amenable  to  be  solved  by  a  homotopy  algorithm  as  in  Osborne  et  al.  (20001  and 


Efron  et  al.  (2004|.  To  avoid  computationally  expensive  cross-validation,  we  use  information  criteria 


to  select  a  proper  amount  of  regularization.  We  compare  the  use  of  Akaike’s  Information  criterion 


(AIG,  Akaike  19731,  a  small-sample  corrected  version  of  the  AIG  (AICc,  Hurvich  et  al.  19981 


and  the  Bayesian  Information  Griterion  (BIG,  Schwartz  19781  for  selecting  the  proper  amount  of 
regularization. 
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We  use  simulations  to  compare  SPLICE  estimates  to  the  I'l-penalized  maximum  likelihood 


estimates  (Banerjee  et  al.  2005,  2007  Yuan  and  Lin  2007  Friedman  et  al.  2008)  and  to  the 


t'l-penalized  Cholesky  approach  in  Huang  et  al.  (2006).  We  have  simulated  both  small  and  large 


sample  data  sets.  Our  simulations  include  model  structures  commonly  used  in  the  literature  (ring 
and  star  topologies,  AR  processes)  as  well  as  a  few  randomly  generated  model  structures.  SPLICE 
had  the  best  performance  in  terms  of  the  quadratic  loss  and  the  spectral  norm  of  the  precision 
matrix  deviation  (||C  —  CII2).  It  also  performed  well  in  terms  of  the  entropy  loss.  SPLICE  had  a 
remarkably  good  performance  in  terms  of  selecting  the  off-diagonal  terms  of  the  precision  matrix:  in 
the  comparison  with  Cholesky,  SPLICE  incurred  a  smaller  number  of  false  positives  to  select  a  given 
number  of  true  positives;  in  the  comparison  with  the  penalized  exact  maximum  likelihood  estimates, 
the  path  following  algorithm  allows  for  a  more  careful  exploration  of  the  space  of  alternative  models. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  presents  our  pseudo-likelihood 
surrogate  function  and  some  of  its  properties.  Section  presents  the  homotopy  algorithm  used 
to  trace  the  SPLICE  regularization  path.  Section  presents  simulation  results  comparing  the 
SPLICE  estimates  with  some  alternative  regularized  methods.  Finally,  Section  concludes  with  a 
short  discussion. 

2  An  approximate  loss  function  for  inverse  covariance  estimation 


In  this  section,  we  establish  a  parametrization  of  the  precision  matrix  S  ^  of  a  random  vector  X 
in  terms  of  the  coefficients  in  the  linear  regressions  among  its  components.  We  emphasize  that  the 


parametrization  we  use  differs  from  the  one  previously  used  by  Wu  and  Pourahmadi  (2003).  Our 


alternative  parametrization  is  used  to  extend  the  approach  used  by  Meinshausen  and  Biihlmann 


(2006)  for  the  purpose  of  estimation  of  sparse  precision  matrices.  The  resulting  loss  function  can 


be  interpreted  as  a  pseudo-likelihood  function  in  the  Gaussian  case.  For  non-Gaussian  data,  the 
minimizer  of  the  empirical  risk  function  based  on  the  loss  function  we  propose  still  yields  consistent 
estimates.  The  loss  function  we  propose  also  has  close  connections  to  linear  regression  and  lends 


itself  well  for  a  homotopy  algorithm  in  the  spirit  of  Osborne  et  al.  (2000)  and  Efron  et  al.  (2004).  A 


comparison  of  this  approximate  loss  function  to  its  exact  counterpart  in  the  Gaussian  case  suggests 


5 


that  the  approximation  is  better  the  sparser  the  precision  matrix. 

In  what  follows,  X  is  a  matrix  containing  in  each  of  its  n  rows  observations  of  the  zero- 

mean  random  vector  X  with  covariance  matrix  S.  Denote  by  Xj  the  j-th  entry  of  X  and  by  Xj* 
the  (p  —  1)  dimensional  vector  resulting  from  deleting  Xj  from  X.  For  a  given  j,  we  can  permute 
the  order  of  the  variables  in  X  and  partition  S  to  get: 


cov 


/ 

X, 

1 

= 

V 

Xj* 

j 

where  J*  corresponds  to  the  indices  in  Xj* ,  so  fijj  is  a  scalar,  £j, j*  is  a  p  —  1  dimensional  row 
vector  and  is  a  (p  —  1)  x  (p  —  1)  square  matrix.  Inverting  this  partitioned  matrix  (see,  for 

instance  Hocking  1996|  yield: 


- 

-1 

- 

1 

J 

3 

1 - 

M 

* 

Ml 

Mg'  _ 

(1) 


where: 


/3f 

dj 

Ms 

Ml 


/3j_i  . . . ,  Pjj-i,  Pjj+i,  •  •  ■  j  f3j,p 


-1) 


—Mg  ^  •  (t,j}  jtT, j*,j^  =  “^/3j, ,  (the  second  equality  due  to  symmetry). 


We  will  focus  on  the  dj  and  j3j  parameters  in  what  follows. 

The  parameters  (5j  and  dj  correspond  respectively  to  the  coefficients  and  the  expected  value  of 
the  squared  residuals  of  the  best  linear  model  of  Xj  based  on  X  j* ,  irrespectively  of  the  distribution 
of  X.  In  what  follows,  we  will  let  (3jk  denote  the  coefficient  corresponding  to  X^  in  the  linear  model 
of  Xj  based  on  Xj*.  We  define: 


•  D:  a  p  X  p  diagonal  matrix  with  dj  along  its  diagonal  and. 
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•  B:  a,  p  X  p  matrix  with  zeros  along  its  diagonal  and  off-diagonal  terms  given  by  (djk- 
Using  0  for  j  =  1, . . .  yields: 


=  D-"(Ip-B) 


(2) 


Since  S  ^  is  symmetric,  Q  implies  that  the  following  constraints  hold: 


dll3jk  =  djf3kj,  for  j,k  =  l,...,p. 


(3) 


Equation  Q  shows  that  the  sparsity  pattern  of  S  ^  can  be  inferred  from  sparsity  in  the  regres¬ 


sion  coefficients  contained  in  B.  Meinshausen  and  Biihlmann  (20061  exploit  this  fact  to  estimate 


the  neighborhood  of  a  Gaussian  graphical  model.  They  use  the  LASSO  (Tibshirani,  19961  to  obtain 
sparse  estimates  of  f3j-. 


PMj)  = 


arg  mm 


|X,-Xj. 


hjf  -h  XjWbjWi,  for  j  =  l,...,p 


(4) 


The  neighborhood  of  the  node  Xj  was  then  estimated  based  on  the  entries  of  the  Pj  that  were  set 
to  zero.  Minor  inconsistencies  could  occur  as  the  regressions  are  run  separately.  As  an  example, 
one  could  have  Pjk{Xj)  =  0  and  Pkj{Xk)  /  0,  which 


Meinshausen  and  Biihlmann 


(20061  solve  by 


defining  AND  and  OR  rules  for  defining  the  estimated  neighborhood. 


To  extend  the  framework  of  Meinshausen  and  Biihlmann  (20061  to  the  estimation  of  precision 


matrices  the  parameters  dij  must  also  be  estimated  and  the  symmetry  constraints  in  0  must  be 


enforced.  We  use  a  pseudo-likelihood  approach  (Besag,  1974)  to  form  a  surrogate  loss  function 


involving  all  terms  of  B  and  D.  For  Gaussian  X,  the  negative  log-likelihood  function  of  Xj  given 
Xj*  is: 


log 


p{X,\Xj.,dj,Pj)]  =  -f  log(27r)  -  f  log(d|)  -  4 


(5) 


The  parameters  and  R-  can  be  consistently  estimated  by  minimizing  (IM.  A  pseudo-neg 
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loglikelihood  function  can  be  formed  as: 


£(X;D,B)  =  log 

=  -f  log(27r)  -  ^logdet(D2)  -  itr  [X(Ip  -  B')D-2(Ip  -  B)X']  . 

An  advantage  of  the  surrogate  £(X;D,B)  is  that,  for  fixed  D,  it  is  a  quadratic  form  in  B.  To 
promote  sparsity  on  the  precision  matrix,  we  propose  using  a  weighted  £i-penalty  on  B: 


(d(A),B(A)) 


arg  min 
{B,D) 


S.t. 


where  ||B||^p  = 


{nlogdet(D2)  +  tr  [X(Ip  -  B')D-2(Ip  -  B)X']}  +  A||B|U,i 


= 

0 

^kk^jk 

= 

d%bkj 

(7) 

= 

0, 

for  k  ^  j 

cRa 

> 

0 

33 

From  (|^, 

the 

precision 

matrix  estimate  C(A)  is  then  given  by: 

C(A)  =  D-2(A) 


Ip 


B(A) 


(8) 


The  weights  Wjk  in  Q  can  be  adjusted  to  accommodate  differences  in  scale  between  the  bjk 
parameters.  In  the  remainder  of  this  paper,  we  fix  Wjk  =  1  for  all  j,  k  such  that  j  ^  k  (notice  that 
the  weights  Wjj  are  irrelevant  since  bjj  =  0  for  all  j). 

The  main  advantage  of  the  pseudo-likelihood  estimates  as  defined  in  0  is  algorithmic.  Fixing 
D,  the  minimizer  with  respect  to  the  B  parameter  is  the  solution  of  a  ^i-penalized  least  squares 


problem.  Hence,  for  fixed  D  it  is  possible  to  adapt  the  homotopy/LARS-LASSO  algorithm  (Os¬ 


borne  et  al.  2000;  Efron  et  al.  2004)  to  obtain  estimates  for  all  values  of  A.  For  each  fixed  B,  the 


minimizer  with  respect  to  D  has  a  closed  form  solution.  The  algorithm  presented  in  Section  is 
based  on  performing  alternate  optimization  steps  with  respect  to  B  and  D  to  take  advantage  of 
these  properties. 

One  drawback  of  the  precision  matrix  estimate  C(A)  in  ([^  is  that  it  cannot  be  ensured  to 
be  positive  semi-definite.  However,  from  the  optimality  conditions  to  0  it  can  be  proven  that 


for  a  large  enough  A,  all  terms  in  the  B(A)  matrix  are  set  to  zero.  For  such  highly  regularized 
estimates,  C'(A)  is  diagonal.  Therefore,  continuity  of  the  regularization  path  implies  existence  of 
A*  <  inf{A  :  B(A)  =  0}  for  which  C(A*)  is  diagonal  dominant  and  thus  positive  semi-definite. 

We  return  to  the  issue  of  positive  semi-definiteness  later  in  the  paper.  We  will  prove  in  Section 


2.4  that,  when  the  £i-norm  is  replaced  by  the  t'2-norm,  positive  semi-definiteness  is  ensured  for 


any  value  of  the  regularization  parameter.  That  suggests  that  a  penalty  similar  to  the  elastic  net 
(]Zou  and  Hastie  20051  can  make  the  estimates  along  the  .^i-norm  regularization  path  positive 
semi-definite  for  a  larger  stretch.  In  addition,  in  Section  we  present  evidence  that  the  £i-norm 
penalized  estimates  are  positive  semi-definite  for  most  of  the  regularization  path. 


2.1  Alternative  normalization 

Before  we  move  on,  we  present  a  normalization  of  the  X  data  matrix  that  leads  to  a  more  natural 
representation  of  the  precision  matrix  C  in  terms  of  B  and  D  while  resulting  in  more  convenient 
symmetry  constraints. 

The  symmetry  constraints  in  ^  can  be  rewritten  in  a  more  insightful  form: 

^Pjk  =  ^Pkj,  for  all  j  /  k.  (9) 

dj 

This  alternative  representation,  suggests  that  the  symmetry  constraints  can  be  more  easily  applied 
to  a  renormalized  version  of  the  data.  We  define: 


X  =  XD-^  and 

B  =  D-^BD. 


(10) 


Under  this  renormalization,  B  is  symmetric,  a  fact  that  will  be  explored  in  the  algorithmic  section 
below  to  enforce  the  symmetry  constraint  within  the  homotopy  algorithm  used  to  trace  regulariza¬ 
tion  paths  for  SPLICE  estimates. 

Another  advantage  of  this  renormalization  is  that  the  precision  matrix  estimate  can  be  written 
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as: 


C(A)  =  D-i  Ip-B(A) 


D 


(11) 


making  the  analysis  of  the  positive  semi-definiteness  of  C(A)  easier:  C(A)  is  positive  semi-definite  if 
and  only  if  Ip  —  B(A)  is  positive  semi-definite.  This  condition  is  satisfied  as  long  as  the  eigenvalues 
of  B(A)  are  smaller  than  1. 


2.2  Comparison  of  exact  and  pseudo-likelihoods  in  the  Gaussian  case 

Banerjee  et  al.  (2005),  Yuan  and  Lin[(2007),  Banerjee  et  al.|(2007)  and  Friedman  et  ah  (2008)  have 
considered  estimates  defined  as  the  minimizers  of  the  exact  likelihood  penalized  by  the  .^i-norm  of 
the  off-diagonal  terms  of  the  precision  matrix  C: 


CexactW  =  argminc  nlogdet(C) -h  tr  [XCX'] -h  A||C||i 

s.t.  C  is  symmetric  positive  semi-definite. 


(12) 


A  comparison  of  the  exact  and  pseudo-likelihood  functions  in  terms  of  the  (D,  B)  parametriza- 
tion  suggests  when  the  approximation  will  be  appropriate.  In  terms  of  (D,  B),  the  pseudo- likelihood 
function  in  (|^  is: 


£(X;  D,  B)  =  log(27r)  -  ^  log  det(D2)  -  A tr  X(Ip  -  ByX.‘ 


(13) 


In  the  same  parametrization,  the  exact  likelihood  function  is: 


Alea;act(X;  D,  B)  — 


—  ^  log(27r)  —  I  log  det  ^Ip  —  B 

-  Mogdet(D2)  -  kr 


X  ( Ip  -  B  )  X' 


(14) 


A  comparison  between  ( 13 )  and  ( 14 )  reveals  two  differences  in  the  exact  and  pseudo  neg-loglikelihood 
functions.  First,  the  exact  expression  involves  one  additional  log  det  ^Ip  —  B^  term  not  appearing 
in  the  surrogate  expression.  Secondly,  in  the  squared  deviation  term,  the  surrogate  expression  has 
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squared  in  the  comparison  to  the  exact  likelihood.  For  B  =  0, 
the  logdet  vanishes  and  the  weighting  term  equals  identity  and  thus  idempotent.  Since  these  two 
functions  are  continuous  in  B,  we  can  expect  this  approximation  to  work  better  the  smaller  the 
off-diagonal  terms  in  B.  In  particular,  in  the  completely  sparse  case,  the  two  functions  coincide 
and  the  approximation  is  exact. 


the  weighting  matrix 


B 


2.3  Properties  of  the  pseudo-likelihood  estimate 


In  the  classical  setting  where  p  is  fixed  and  n  grows  to  infinity,  the  unregularized  pseudo-likelihood 
estimate  C(0)  is  clearly  consistent.  The  unconstrained  estimates  for  Pj  and  are  all  consistent. 
In  adition,  the  symmetry  constraints  we  impose  are  observed  in  the  population  version  and  thus 
introduce  no  asymptotic  bias  in  the  (symmetry)  constrained  estimates.  That  in  conjunction  with 
the  results  from  Knight  and  Fu  (|2000 1  can  be  used  to  prove  that,  as  long  as  A  =  Op{n)  the  £1- 
penalized  estimates  in  0  are  consistent. 


Still  in  the  classical  setting.  Yuan  and  Lin  (20071  point  out  that  the  unpenalized  estimate 


C(0)  is  not  efficient  as  it  does  not  coincide  with  the  maximum  likelihood  estimate.  However,  the 


comparison  of  the  exact  and  pseudo- likelihoods  presented  in  Section  2.2  suggests  that  the  loss  in 
efficiency  should  be  smaller  the  sparser  the  true  precision  matrix.  In  addition,  the  penalized  pseudo- 
likelihood  estimate  lends  itself  better  for  path  following  algorithms  and  can  be  used  to  select  an 
appropriate  A  while  simultaneously  providing  a  good  starting  point  for  algorithms  computing  the 
exact  solution. 

One  interesting  question  we  will  not  pursue  in  this  paper  concerns  the  quality  of  the  pseudo- 
likelihood  approximation  in  the  non-classical  setting  where  pn  is  allowed  to  grow  with  n.  In  that 
case,  the  classical  efficiency  argument  favoring  the  exact  maximum  likelihood  over  the  pseudo- 
likelihood  approximation  no  longer  holds.  To  the  best  of  our  knowledge,  it  is  an  open  question 
whether  the  penalized  exact  ML  has  advantages  over  a  pseudo-likelihood  approximation  in  the 
non-parametric  case  {pn  growing  with  n). 
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2.4  Penalization  by  ^2-norni  and  Positive  Semi-definiteness 

As  mentioned  above, the  algorithm  we  propose  in  Section  [^suffers  from  the  drawback  of  not  enforc¬ 
ing  a  positive  semi-definite  constraint.  Imposing  such  constraint  is  costly  in  computational  terms 
and  would  slow  down  our  path  following  algorithm.  As  we  will  see  in  the  experimental  Section 
below,  the  unconstrained  estimate  is  positive  semi-definite  for  the  greater  part  of  the  regularization 
path  even  in  nearly  singular  designs. 

Before  we  review  such  experimental  evidence,  however,  we  study  an  alternative  penalization 
that  does  result  in  positive  semi-definite  estimates  for  all  levels  of  regularization.  Consider  the 
penalized  pseudo-likelihood  estimate  defined  by: 


B2(A2)  =  argmin  |tr  f(Ip  -  B)X'X(Ip  -  B') 

B  I  L 


-I-  A2  •  tr 


B'B 


s.t. 


^jj  ~  0 
^kj 


(15) 


Our  next  result  establishes  that  the  estimate  defined  by  is  positive  semi-definite  along  its  entire 
path: 

Theorem  1.  Let  C{X2)  =  D2(A2)-^  (ip  -  ^2{X2))  D2(A2)-^  be  the  preeision  matrix  estimate  re¬ 


sulting  from  (15).  For  any  D2(A2)  and  X2  >  0,  the  preeision  matrix  estimate  C2(A2)  is  positive 
semi-definite. 

This  result  suggests  that  the  .^2-penalty  may  be  useful  in  inducing  positive  semi-definiteness. 
So  an  estimate  incorporating  both  £2-  and  £i-penalties. 


B£;7v(A2,  Ai)  =  argmin  tr  (Ip  -  B)X'X(Ip  -  B' 


-I-  A2  •  tr 


B'B 


+  Ai 


B'B 


W,1 


S.t. 


^jj  ~  0 
bjk  b 


(16) 


kj 


can  have  improved  performance,  over  the  £i-penalty  alone,  in  terms  of  positive  semi-definiteness. 


The  subscript  EN  is  a  reference  to  the  Elastic  Net  penalty  of  Zou  and  Hastie  (20051.  In  our 


experimental  section  below  (Section]^,  we  have  came  across  non-positive  semi-definite  precision 
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matrix  estimate  in  none  other  than  a  few  replications.  We  hence  left  a  detailed  study  of  the  Elastic 


Net  precision  matrix  estimate  defined  in  (16)  for  future  research. 


3  Algorithms 

We  now  detail  an  iterative  algorithm  for  computing  the  regularization  path  for  SPLICE  estimates. 
The  computational  advantage  of  that  approximate  loss  function  used  in  (Q  follows  from  two  facts: 

•  For  a  fixed  D,  the  optimization  problem  reduces  to  that  of  tracing  a  constrained  LASSO 
path; 

•  For  a  fixed  B,  the  optimizer  D  has  a  closed  form  solution; 

Based  on  these  two  facts,  we  propose  an  iterative  algorithm  for  obtaining  estimates  D(A)  and 
B(A).  Within  each  step,  the  path  for  D  and  B  as  a  function  of  the  regularization  parameter  A  is 
traced  and  the  choice  of  the  regularization  parameter  A  is  updated.  We  now  describe  the  steps  of 
our  SPLICE  algorithm. 

1.  Let  k  =  1  and  Dq  =  diag 

2.  Repeat  until  convergence-. 

(a)  Set  X  =  and  use  a  path-tracing  algorithm  for  the  regularized  problem: 


B(A)  =  argmin  jtr  [x(I  -  B')(I  -  B)X' 

B  L  L 
S.t.  —  ^kj 


+  A-  IIBI 


^jj  ~  d 


(b)  Compute  the  corresponding  path  for  D: 


Dfc(A)  =  diag 


Ip-B(A))X,||2 


n 


(c)  Select  a  value  for  the  regularization  parameter  A^; 
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(d)  Set  Bfc  =  B(Afc)  and  =  D(Afc); 


3.  Return  the  estimate  C  =  ^(I  —  Bfc)D^ 

A  subtle  but  important  detail  is  the  weighting  vector  w  used  when  applying  the  penalty  in  step 
2. (a).  Since  the  path  is  traced  in  terms  of  the  normalized  B  parameter  instead  of  B,  a  correction 
must  be  made  in  these  weights.  This  can  be  determined  by  noticing  that: 


IB 


p  p 

|'U),l  ^  ^  ^jk  \^jk\  ^  ^  '^jk 

j,k=l  j,k=l 


dk 


dk  r 
—  bi 


'jk 


j,k=l 


^jk 


=  IIBI 


as  long  as  Wjk  =  u!jk^-  As  mentioned  before,  we  fix  Wjk  =  1  throughout  this  paper,  so  we  set 
Wjk  =  '^jk^  in  our  experiments.  Of  course,  dj  are  unknown  so  the  current  estimate  is  plugged- in 
every  time  step  2. (a)  is  executed. 

In  the  remainder  of  this  section,  we  show  how  to  adapt  the  homotopy/LARS-LASSO  algorithm 
to  enforce  the  symmetry  constraints  d‘f,bjk  =  djbkj  along  the  regularization  path,  how  to  select  B 
and  D  from  the  path,  and  discuss  some  convergence  issues  related  to  the  algorithm  above. 


3.1  Enforcing  the  symmetry  constraints  along  the  path 

The  expression  defining  the  regularization  path  in  step  2. (a)  of  the  SPLICE  algorithm  above  can 
be  rewritten  as: 

B(A)  =  argrrnn  vec  ^X(Ip  —  B')^  vec  ^X(Ip  —  B')^ -|- A||B||^p 
s.t.  bjj  =  0 

bjk  —  bkj, 

which  is  a  quadratic  form  in  B  penalized  by  a  weighted  version  of  its  .^i-norm.  To  enforce  the 
equality  restriction  in  &  we  massage  the  data  into  an  appropriate  form  so  the  homotopy/LARS- 
LASSO  algorithm  can  be  used  in  its  original  form. 

The  optimization  problem  in  (|17[)  corresponds  to  a  constrained  version  of  a  penalized  regression 
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of  modified  response  (Y)  and  predictors  (Z)  given  by: 


Xi*  0  0 

0  X2*  0 

Z  =  0  0  X3* 

0  0  0 

Since  the  “model”  for  Y  given  Z  is  additive,  we  can  force  bjk  =  bjk  by  creating  a  modified 
design  matrix  Z  where  the  columns  corresponding  to  bjk  and  bkj  are  summed  into  a  single  column. 
More  precisely,  the  column  corresponding  to  bjk  with  j  <  k  in  the  Z  design  matrix  has  all  elements 
set  to  zero  except  for  the  rows  corresponding  to  and  Xj  in  the  Y  vector.  These  rows  must  be 
set  to  Xj  and  X^  respectively: 

X2  X3  •  •  •  Xp_i  Xp  0  •  •  •  0  •  •  •  0 

X3  0  •  •  •  0  0  X3  •  •  •  Xp  •  •  •  0 

0  Xi  •••  0  0  X2  •••  0  •••  0 

z  = 

0  0  •••  Xi  0  •••  0  0  •••  Xp 

0  0  •••  0  Xi  --  0  0  •••  Xp_i 

The  path  for  the  constrained  B(A)  can  then  be  traced  by  simply  applying  a  weighted  version 
of  the  LASSO  algorithm  to  Y  and  Z.  We  emphasize  that,  even  though  the  design  matrix  Z  has 
0{np^)  elements,  it  is  extremely  sparse.  It  can  be  proved  that  it  contains  only  0{np^)  non-zero 
terms.  It  is  thus  crucial  that  the  implementation  of  the  homotopy/LARS-LASSO  algorithm  used 
to  trace  this  path  make  use  of  the  sparsity  of  the  design  matrix. 


and 

0 
0 
0 
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3.2  Computational  complexity  of  one  step  of  the  algorithm 

The  algorithm  proposed  above  to  trace  the  regularization  path  of  B(A)  uses  the  homotopy/LARS- 
LASSO  algorithm  with  a  modified  regression  with  regressors  and  np  observations.  The 

computational  complexity  of  the  k-th  step  of  the  homotopy  algorithm  in  step  2. (a)  is  of  order 
0(a^  +  akup),  where  denotes  the  number  of  non-zero  terms  on  the  upper  triangular,  off-diagonal 
of  B(Afc)  (a  detailed  analysis  is  presented  in 


Efron  et  al. 


2004).  Clearly,  the  computational  cost 


increases  rapidly  as  more  off-diagonal  terms  are  added  to  the  precision  matrix. 

When  p  grows  with  n  keeping  a  constant  ratio  p/n,  we  find  it  plausible  that  most  selection 
criteria  will  pick  estimates  at  the  most  regularized  regions  of  the  path:  a  data  sample  can  only 
support  so  many  degrees  of  freedom.  Thus,  incorporating  early  stopping  criteria  into  the  path 
tracing  at  step  2. (a)  of  the  SPLICE  algorithm  can  greatly  reduce  the  computational  cost  of  obtaining 
the  path  even  further  without  degrading  the  statistical  performance  of  the  resulting  estimates. 

If  one  insists  in  having  the  entire  precision  matrix  path,  the  SPLICE  algorithm  is  still  polynomial 
in  p  and  n.  In  the  case  where  no  variables  are  dropped  and  the  variables  are  added  one  at  a  time,  the 
complexity  of  the  first  K  steps  of  the  path  is  given  by  0{K^  +  K^np).  Under  the  same  assumptions 
and  setting  K  =  0.5  ■p{p—  1),  the  SPLICE  algorithm  has  cost  of  order  0{p^  -\-np^)  to  compute  the 
entire  path  of  solutions  to  the  pseudo-likelihood  problem.  As  a  comparison,  an  algorithm  presented 
by 


Banerjee  et  al. 


(2005)  has  a  cost  of  order  to  compute  an  approximate  solution  for  the 


penalized  exact  likelihood  at  a  single  point  of  the  path,  where  p  represents  the  desired  level  of 
approximation. 


3.3  Selection  criteria  for  A: 

Different  criteria  can  be  used  to  pick  a  pair  B,D  from  the  regularization  path  (cf.  2(c)  in  the 
SPLICE  algorithm).  In  the  experiment  section  below,  we  show  the  results  of  using  Akaike’s  In¬ 


formation  Criterion  (AIC,  Akaike  1973),  the  Bayesian  information  Criterion  (BIC,  Schwartz 


1978)  and  a  corrected  version  of  the  AIC  criterion  (AICc,  Hurvich  et  al.  1998)  using  the  unbiased 
estimate  of  the  degrees  of  freedom  of  the  LASSO  along  the  regularization  path.  More  precisely,  we 
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set: 


A  =  argminA£exact(X,D,B(A))  +  Er(n,  df(C(A)) 


where: 


X(n,df(C(A))) 


2df(C(A)), 

I  I  df(C(A)) 

n _ 

^  df(C(A))+2  ’ 

n 

log(n)df(C(A)), 

\ 


for  AIC, 
for  AICc, 
for  BIG, 


As  our  estimate  of  the  degrees  of  freedom  along  the  path,  we  follow  Zou  et  ah  (2007|  and  use  p 


plus  the  number  of  non-zero  terms  in  the  upper  triangular  section  of  B(A),  that  is, 


df(A)  =  p-h  |(i,j)  :  *  <  j  and  Bij(A)  /  o| 


(18) 


The  p  term  in  expression  (18)  stems  from  the  degrees  of  freedom  used  for  estimating  the  means. 


3.4  Stopping  criterion 

We  now  determine  the  convergence  criterion  we  used  to  finish  the  loop  started  in  step  (2)  of  the 
SPLICE  algorithm.  To  do  that,  we  look  at  the  variation  in  the  terms  of  the  diagonal  matrix  D. 
We  stop  the  algorithm  once: 


max  <  log 
i<l<p 


V  J  J 


<  10“2, 


or  when  the  number  of  iterations  exceeds  a  fixed  number  Q,  whichever  occurs  first. 

We  have  also  observed  that  letting  A^+i  differ  from  Afc  often  resulted  in  oscillating  estimates 
caused  solely  by  small  variations  in  A  from  one  step  to  the  next.  To  avoid  this  issue,  we  fixed  the 
regularization  parameter  A  after  a  number  M  <  Q  of  “warm-up”  rounds. 

We  have  set  M  =  6  and  Q  =  100  for  the  simulations  in  Section]^  For  most  cases,  the  selected 
value  of  A  and  the  D  matrix  had  become  stable  and  the  algorithm  had  converged  before  either  the 
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maximum  number  M  of  “warm-up”  steps  was  reached. 


4  Numerical  Results 


In  this  section,  we  compare  the  results  obtained  by  SPLICE  in  terms  of  estimation  accuracy  and 
model  selection  performance  to  other  covariance  selection  methods,  namely  the  Cholesky  covariance 


selection  procedure  proposed  by  Huang  et  ah  (20061  (Cholesky)  and  the  £i-penalized  maximum 


likelihood  estimate  studied  previously  by  Yuan  and  Lin  (2007|;  Banerjee  et  ah  (2005 1  and  Friedman 


et  ah  (2008)  (Exact  PML).  We  compare  the  effectiveness  of  three  different  selection  criteria  -  AIC, 


AICc  and  BIC  (see  Section  3.3 1-  in  picking  an  estimate  from  the  SPLICE,  Cholesky  and  Penalized 
Exact  Log-Likelihood  regularization  paths.  We  also  compare  the  model  selection  performance  over 


the  entire  regularization  path  using  ROC  curves  (for  details,  see  4.1.2).  Finally,  we  study  the 
positive  semi-definiteness  of  the  SPLICE  estimates  along  the  regularization  path  in  a  near-singular 
design. 


4.1  Comparison  of  SPLICE  to  alternative  methods 

Figure  shows  the  graphical  models  corresponding  to  the  simulated  data  sets  we  will  be  using 
to  compare  SPLICE,  Cholesky  and  Exact  PML  in  terms  of  estimation  accuracy  and  covariance 
selection.  All  designs  involve  a  15-dimensional  zero-mean  Gaussian  random  vector  [p  =  15)  with 
precision  matrix  implied  by  the  graphical  models  shown  in  Figure  A  relatively  small  sample 
size  is  used  to  emulate  the  effect  of  high-dimensionality.  For  all  comparisons,  the  estimates  are 
computed  based  on  either  20  or  1,000  independent  samples  from  each  distribution  (small  sample 
case:  n  =  20,  large  sample  case:  n  =  1,000).  The  results  presented  here  are  based  on  r  =  200 
replications  for  each  case. 

Before  we  show  the  results,  a  few  words  on  our  choice  of  simulation  cases: 

•  The  star  model  (see  Figure  panel  a)  provides  an  interesting  example  where  the  ordering  of 
the  variables  can  have  a  great  influence  in  the  results  of  an  order-dependent  method  such  as 
Cholesky.  In  the  Direct  Star  case,  the  hub  variable  is  the  first  entry  in  the  15-dimensional 
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;ases:  In  this  tab 
case,  the  precisioi 
.enever  it  differs  fi 


random  vector  (as  shown  in  the  figure).  Conditioning  on  the  first  variable  (Xi)  makes  all 
other  variables  independent  {X2,  ■  ■  ■ ,  ^15).  Meanwhile,  in  the  inverted  star  topology,  the  hub 
variable  is  put  at  the  last  position  of  the  15-dimensional  random  vector.  As  a  result,  no 
conditional  independence  is  present  until  the  last  variable  is  added  to  the  conditioning  set. 


In  the  “AR-like”  family  of  models  (see  Figure  [^panels  b,  c  and  d)  each  15-dimensional  random 
vector  corresponds  (panels  c  and  d)  or  is  very  similar  (panel  b)  to  15  observations  from  an 
auto-regressive  process.  This  family  of  models  tends  to  give  some  advantage  to  the  Cholesky 
procedure  as  the  ordering  of  the  variables  within  the  vector  contain  some  information  about 
the  dependency  structure  among  the  variables.  The  cases  in  this  family  are  loosely  based  on 


some  of  the  simulation  designs  used  in  Yuan  and  Lin  (2007|; 


•  The  “random”  designs  (see  Figure  panels  e,  f  and  g)  were  obtained  by  randomly  choosing 
precision  matrices  as  described  in  Appendix  We  used  these  designs  to  make  sure  our  results 
are  valid  in  somewhat  less  structured  environments. 


4.1.1  Estimation  accuracy  of  SPLICE,  Cholesky  and  Exact  PML 

We  evaluate  the  accuracy  of  the  precision  matrix  estimates  according  to  following  four  metrics. 
•  The  quadratic  loss  of  the  estimated  precision  matrix,  defined  as 

A2(C,C)  =  tr(cC-^-Ip)^ 


•  The  entropy  loss  at  the  estimated  precision  matrix,  defined  as 

Ae(C,C)  =  tr  (^CC-^)  -  log  (CC-I)  -  n. 


•  The  spectral  norm  of  the  deviation  of  the  estimated  precision  matrix 

spectral  norm  of  a  square  matrix  A  is  defined  as  || A||2  =  sup^,  . 

•  The  spectral  norm  of  the  deviation  of  the  estimated  covariance  matrix 


S-S  . 


where  the 
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For  each  of  Cholesky,  SPLICE  and  Exact  PML,  we  compute  estimates  taken  from  their  paths 


using  the  selection  criteria  mentioned  in  3.3  AIC,  AICc  and  BIC.  The  Cholesky  and  SPLICE 
estimates  are  chosen  from  the  breakpoints  of  their  respective  regularization  paths.  The  path  trac¬ 
ing  algorithm  we  have  used  for  the  Cholesky  estimate  is  sketched  in  Appendix  |B.l ,  The  path 
following  algorithm  for  SPLICE  is  the  one  described  in  Section  The  Exact  ML  estimate  is  cho¬ 
sen  by  minimizing  the  selection  criterion  over  an  equally  spaced  500-point  A-grid  between  0  and 
the  maximum  absolute  value  of  the  off-diagonal  terms  of  the  sample  covariance  matrix.  We  used 
the  implementation  of  the  £i-penalized  Exact  log-likelihood  for  Matlab  made  available  at  Prof. 
Alexandre  D’Aspremont’s  web  site  (http://www.princeton.edu/~asprenion/). 

Boxplots  of  the  different  accuracy  measures  for  each  of  the  methods  and  selection  criteria  are 
shown  in  Figures  ii  and  1^  for  the  small  sample  case  (n  =  20),  and  in  Figures  EE  and  for 
the  large  sample  case  (n  =  1,000).  For  larger  sample  sizes,  the  Cholesky  estimates  do  suffer  some 
deterioration  in  terms  of  the  entropy  and  quadratic  losses  when  an  inappropriate  ordering  of  the 
variables  is  used.  As  we  will  later  see,  an  inappropriate  ordering  can  also  degrade  the  Cholesky 
performance  in  terms  of  selecting  the  right  covariance  terms. 

A  comparison  of  the  different  methods  reveals  that  the  best  method  to  use  depends  on  whether 
the  metric  used  is  on  the  inverse  covariance  (precision  matrix)  or  covariance  matrix. 


•  With  respect  to  all  the  three  metrics  on  the  inverse  covariance  (quadratic,  entropy  and  spectral 
norm  losses),  the  best  results  are  achieved  by  SPLICE.  In  the  case  of  the  quadratic  loss, 
this  result  can  be  partially  attributed  to  the  similarity  between  the  quadratic  loss  and  the 
pseudo-likelihood  function  used  in  the  SPLICE  estimate.  In  terms  of  the  spectral  norm  on 
the  precision  matrix  (||C  —  CII2),  SPLICE  performed  particularly  well  in  larger  sample  sizes 
(n  =  1,  000).  Eor  the  quadratic  and  entropy  loss  functions,  AIC  was  the  criterion  picking  the 
best  SPLICE  estimates.  In  terms  of  the  spectral  norm  loss,  SPLICE  performs  better  when 
coupled  with  AICc  for  small  sample  sizes  (n  =  20)  and  when  coupled  with  BIC  in  larger 
sample  sizes  (re  =  1,000). 

•  In  terms  of  the  spectral  norm  of  the  covariance  estimate  deviation  (||S  —  SII2),  the  best 
performance  was  achieved  by  Exact  PML.  The  performance  of  Exact  PML  was  somewhat 
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Performance  Metric 

Recommended  procedure 

A2(C,C) 

SPLICE  +  AIC 

Ae(C,C) 

SPLICE  +  AIC 

IIC-CII2 

SPLICE  +  AICc  (for  smaller  sample  size,  n  =  20) 
SPLICE  +  BIC  (for  larger  sample  size,  n  =  1,000) 

IIS-SII2 

Exact  penalized  ML  +  AIC 

Table  1:  Suitable  estimation  methods  for  different  covariance  structures  and  perfor¬ 
mance  metrics:  The  results  shown  here  are  a  summary  of  the  results  shown  in  Figures  [^through 
For  each  metric,  we  show  the  best  combination  of  estimation  method  and  selection  criterion 
based  on  our  simulations. 


insensitive  to  the  selection  criterion  used  in  many  cases:  this  may  be  caused  by  the  uniform 
grid  on  A  missing  regions  where  the  penalized  models  rapidly  change  as  A  varies.  Based  on 
the  cases  where  the  selection  criterion  affected  the  performance  of  Exact  PML,  BIG  seems  to 
yield  the  best  results  in  terms  of  ||S  —  S||2. 

For  ease  of  reference,  these  results  are  collected  in  Table 

4.1.2  Model  Selection  performance  of  SPLICE 

To  evaluate  the  model  selection  performance  of  the  different  covariance  selection  methods,  we 
compare  their  Relative  Operating  Characteristic  (ROC)  curves  defined  as  a  curve  containing  in 
its  horizontal  axis  the  minimal  number  of  false  positives  that  is  incurred  on  average  for  a  given 
number  of  true  number  of  positives,  shown  on  the  vertical  axis,  to  be  identified.  The  ROC  curve  for 
a  method  shows  its  model  selection  performance  over  all  possible  choices  of  the  tuning  parameter 
A.  We  have  chosen  to  compare  ROC  curves  instead  of  the  results  for  particular  selection  criterion 
as  different  applications  may  penalize  false  positives  and  negatives  differently. 
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Figure  2:  Accuracy  metrics  for  precision  matrix  estimates  in  the  Star  cases  for  p 


Figure  3:  Accuracy  metrics  for  precision  matrix  estimates  in  the  AR-like  cases  for 

p  =  15  and  n  =  20 
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Figure  4:  Accuracy  metrics  for  precision  matrix  estimates  in  the  randomly  generated 
designs  for  p  =  15  and  n  =  20 
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Figure  5:  Accuracy  metrics  for  precision  matrix  estimates  in  the  Star  cases  for  p 


Figure  6:  Accuracy  metrics  for  precision  matrix  estimates  in  the  AR-like  cases  for 
p  =  15  and  n  =  1,  000 
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Figure  7:  Accuracy  metrics  for  precision  matrix  estimates  in  the  randomly  generated 
designs  for  p  =  15  and  n  =  1,000 
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For  a  fixed  number  of  true  positives  on  the  vertical  axis,  we  want  the  expected  minimal  number 
of  false  positives  to  be  as  low  as  possible.  A  covariance  selection  method  is  thus  better  the  closer 
its  ROC  curve  is  to  the  upper  left  side  of  the  plot. 

Figures  and  compare  the  ROC  curves  for  the  Cholesky  and  SPLICE  covariance  selection 
procedures  for  sample  sizes  n  =  20  and  n  =  1,000  respectively.  The  Exact  ML  does  not  have 
its  ROC  curve  shown:  the  grid  used  to  approximate  its  regularization  path  often  did  not  include 
estimates  with  a  specific  number  of  true  positives.  A  finer  grid  can  ameliorate  the  problem,  but 
would  be  prohibitively  expensive  to  compute  (recall  we  used  a  grid  with  500  equally  spaced  values 
of  A).  This  illustrates  an  advantage  of  path  following  algorithms  over  using  grids:  path  following 
performs  a  more  thorough  search  on  the  space  of  models. 

The  mean  ROC  curves  on  Eigures]^ and show  that  SPLICE  had  a  better  performance  in  terms 
of  model  selection  in  comparison  to  the  Cholesky  method  over  all  cases  considered.  In  addition, 
for  a  given  number  of  true  positives,  the  number  of  false  positives  incurred  by  SPLICE  decreases 
significantly  as  the  sample  size  increases.  Our  results  also  suggest  that,  with  the  exception  of  the 
Random  Design  02,  the  chance  that  the  SPLICE  path  contains  the  true  model  approaches  one  as 
the  sample  size  increases  for  all  simulated  cases. 

Einally,  we  consider  the  effect  of  ordering  the  variables  on  the  selection  performance  of  SPLICE 
and  Cholesky.  To  do  this,  we  restrict  attention  to  the  “star”  cases  and  compare  the  performance 
of  SPLICE  and  Cholesky  when  the  variables  are  presented  in  the  “correct”  order  (Ai,  A2, . . . ,  Xp) 


and  in  the  inverted  order  {Xp,  Ap_i, . . . ,  Xi).  Figure  10  shows  the  boxplot  of  the  minimal  number 
of  false  positives  on  200  replications  of  the  Cholesky  and  SPLICE  paths  for  selected  number  of 
true  positives  in  the  small  sample  case  (n  =  20).  In  addition  to  outperforming  Cholesky  by  a  wide 
margin,  SPLICE  is  not  sensitive  to  the  order  in  which  the  variables  are  presented.  Cholesky,  on 
the  other  hand,  suffers  some  further  degradation  in  terms  of  model  selection  when  the  variables  are 
presented  in  the  reverse  order. 
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Figure  8:  “Mean  ROC  curves”  for  the  Cholesky  and  SPLICE  covariance  selection  for  p  =  15  and  n  =  20:  Within 
each  panel,  the  relative  operating  characteristic  (ROC)  curve  shows  the  mean  minimal  number  of  false  positives  (horizontal  axis) 
needed  to  achieve  a  given  number  of  true  positives  (vertical  axis)  for  both  the  SPLICE  (solid  lines)  and  Cholesky  (dashed  lines). 
A  selection  procedure  is  better  the  more  its  curve  approaches  the  upper  left  corner  of  the  plot.  Our  results  suggest  that  SPLICE 
trades  off  better  than  Cholesky  between  false  and  true  positives  across  all  cases  considered. 
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number  of  false  positives  on  the  path  for  the  indicated  number  of  true  positives.  Each  panel  can  be  thought  of  as  the  boxplot 
corresponding  to  a  horizontal  layer  on  the  graph  shown  in  Figure  SPLICE  is  insensitive  to  the  permutation  of  the  variables 
(compate  inverted  and  direct).  Cholesky  performs  worse  than  SPLICE  in  both  the  direct  and  inverted  cases:  its  performance 
deteriorates  further  if  the  variables  are  presented  in  inverted  order. 


4.2  Positive  Semi-Definiteness  along  the  regularization  path 

As  noted  in  Sectionj^above,  there  is  no  theoretical  guarantee  that  the  SPLICE  estimates  be  positive 


semi-definite.  In  the  somewhat  well-behaved  cases  studied  in  the  experiments  of  Section  4.1  (see 


Figure  Q  all  of  the  estimates  selected  by  either  AICc  and  BIC  were  positive  semi-definite  cases. 
In  only  6  out  of  the  1,600  simulated  cases,  did  AIC  choose  a  slightly  negative  SPLICE  estimate. 
This,  however,  tells  little  about  the  positive-definiteness  of  SPLICE  estimates  in  badly  behaved 
cases.  We  now  provide  some  experimental  evidence  that  the  SPLICE  estimates  can  be  positive 
semi-definite  for  most  of  the  regularization  path  even  when  the  true  covariance  matrix  is  nearly 
singular. 

The  results  reported  for  this  experiment  are  based  on  200  replications  of  SPLICE  applied 
to  a  data  matrix  X  sampled  from  a  Gaussian  distribution  with  near  singular  covariance  matrix. 
The  number  of  observations  (n  =  40)  and  the  dimension  of  the  problem  {p  =  30)  are  kept  fixed 
throughout.  To  obtain  a  variety  of  near  singular  covariance  matrices,  the  sample  covariance  S  G 
of  each  of  the  200  replications  is  sampled  from: 


Wishart(n,  S),  with  =  0.99 


l*-il 


The  covariance  matrices  sampled  from  distribution  have  expected  value  S,  which  is  itself  close  to 
singular.  We  let  the  number  of  degrees  of  freedom  of  the  Wishart  distribution  be  small  (equal  to 
the  sample  size  n  =  30)  to  make  designs  close  to  singular  more  likely  to  happen.  Once  S  is  sampled, 
the  data  matrix  X  is  then  formed  from  an  i.i.d.  sample  of  a  A^(0,  S)  distribution. 

To  align  the  results  along  the  path  of  different  replications,  we  create  an  index  A  formed  by 
dividing  a  A  on  the  path  by  the  the  maximum  A  at  that  path.  This  index  varies  over  [0, 1]  and  lower 


values  of  A  correspond  to  less  regularized  estimates.  Figure  11  shows  the  minimum  eigenvalue  of 
C(A)  versus  the  index  A  for  the  200  simulated  replications.  We  can  see  that  non-positive  estimates 
only  occur  near  the  very  end  of  the  path  (small  values  of  A)  even  in  such  an  extreme  design. 
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5  Discussion 


In  this  paper,  we  have  defined  Sparse  Pseudo-Likelihood  Inverse  Covariance  Estimates  (SPLICEs) 
as  a  £i-penalized  pseudo-likelihood  estimate  for  precision  matrices.  The  SPLICE  loss  function  ^  is 


obtained  from  extending  previous  work  by  Meinshausen  and  Biihlmann  (2006)  to  obtain  estimates 
of  precision  matrix  that  are  symmetric.  The  SPLICE  estimates  are  formed  from  estimates  of  the 
coefficients  and  variance  of  the  residuals  of  linear  regression  models. 

The  main  advantage  of  the  estimates  proposed  here  is  algorithmic.  The  regularization  path 
for  SPLICE  estimates  can  be  efficiently  computed  by  alternating  the  estimation  of  coefficients  and 
variance  of  the  residuals  of  linear  regressions.  For  fixed  estimates  of  the  variance  of  residuals,  the 
complete  path  of  the  regression  coefficients  can  be  traced  efficiently  using  an  adaptation  of  the 


homotopy/LARS-LASSO  algorithm  (Osborne  et  al.  2000  Efron  et  ah,  2004)  that  enforces  the 
symmetry  constraints  along  the  path.  Given  the  path  of  regression  coefficients,  the  variance  of  the 
residuals  can  be  estimated  by  means  of  closed  form  solutions.  An  analysis  of  the  complexity  of  the 
algorithm  suggests  that  early  stopping  can  reduce  its  computational  cost  further.  A  comparison 
of  the  pseudo-likelihood  approximation  to  the  exact  likelihood  function  provides  another  argument 
in  favor  of  early  stopping:  the  pseudo-likelihood  approximation  is  better  the  sparser  the  estimated 
model.  Thus  moving  on  to  the  lesser  sparse  stretches  of  the  regularization  path  can  be  not  only 
computationally  costly  but  also  counterproductive  to  the  quality  of  the  estimates. 

We  have  compared  SPLICE  with  .^i-penalized  covariance  estimates  based  on  Cholesky  decom¬ 


position  (Huang  et  al.  2006)  and  the  exact  likelihood  expression  (Banerjee  et  al.  2005  2007  Yuan 


and  Lin ,  2007  Friedman  et  al.  2008 )  for  a  variety  of  sparse  precision  matrix  cases  and  in  terms  of 


four  different  metrics,  namely:  quadratic  loss,  entropy  loss,  spectral  norm  of  C  —  C  and  spectral 
norm  of  S  —  S.  SPLICE  estimates  had  the  best  performance  in  all  metrics  with  the  exception 
of  the  spectral  norm  of  S  —  S.  For  this  last  metric,  the  best  results  were  achieved  by  using  the 
I'l-penalized  exact  likelihood. 

In  terms  of  selecting  the  right  terms  of  the  precision  matrix,  SPLICE  was  able  to  pick  a  given 
number  of  correct  covariance  terms  while  incurring  in  less  false  positives  than  the  £i-penalized 


Cholesky  estimates  of  Huang  et  al.  (2006)  in  various  simulated  cases.  Using  an  uniform  grid 
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for  the  exact  penalized  maximum  likelihood  provided  few  estimates  with  a  mid-range  number 
of  correctly  picked  covariance  terms.  This  reveals  that  path  following  methods  perform  a  more 
thorough  exploration  of  the  model  space  than  penalized  estimates  computed  on  (pre-determined) 
grids  of  values  for  the  regularization  parameter. 

While  SPLICE  estimates  are  not  guaranteed  to  be  positive  semi-definite  along  the  entire  regu¬ 
larization  path,  they  have  been  observed  to  be  such  for  most  of  the  path  even  in  a  almost  singular 
problem.  Over  tamer  cases,  the  estimates  selected  by  AIC,  BIC  and  AICc  were  positive  semi- 
definite  in  the  overwhelming  majority  of  cases  (1,594  out  of  1,600). 


A  Proofs 


A.l  Positive  Semi-definiteness  of  t'2-penalized  Pseudo-likelihood  estimate 

We  now  prove  Theorem  First,  we  rewrite  the  .^2-iiorm  penalty  in  a  more  convenient  form: 


p  -I-  tr 


B'B 


p  p 


p  p 


=p+Yl  -  ^^4)'  +  E  =  tr  [(Ip  -  B)'(ip  -  B) 

j=l  k=l  j=l  j=l  k=l 


Hence,  the  ^2-penalized  estimate  defined  in  (15 1,  can  be  rewritten  as: 


B2(A2)  =  argmjn  tr  (Ip  -  B)  (X'X  +  A2lp)  (Ip  -  B') 


s.t. 


bjj  =  0,  for  j  =  1, . . .  ,p 

bjk  =  hj  ioi  j  =  I,.. .  ,p,k  =  j +  1,.  ..,p 


(19) 


Using  convexity.  The  Karush-Kuhn- Tucker  (KKT)  conditions  are  necessary  and  sufficient  to 
characterize  a  solution  of  problem 


X'X  +  A2lp )  (Ip  -  B2(A2))  +  0  +  P  =  0, 


(20) 


where  0  is  a  diagonal  matrix  and  P  is  an  anti-symmetric  matrix.  Given  that  bjj  =  0  and  ujjj  =  0 


36 


(anti-symmetry),  it  follows  that,  for  A2  >  0: 


=  -(x'X.  +  As)  <0, 


(21) 


that  is,  —0  is  a  positive  definite  matrix. 

From  (20),  we  can  conclude  that  (Ip  —  B2 


satisfies: 


(x'x  +  A2lp)  (ip  -  i2(A2))  +  (ip  -  i2(A2))'  (x'X  +  A2lp)  =  -20. 

Theorem  then  follows  from  setting  U  =  (Ip  —  B2(A2)),  V  =  (X'X  -|-  A2lp)  and  W  =  —0  and 
applying  the  following  lemma: 


Lemma  1.  Let  U,  V  and  W  be  p  x  p  symmetrie  matriees.  Suppose  that  V  is  strietly  positive 
definite  and  W  is  positive  semi- definite  and: 


UV  -h  VU  =  W. 


It  follows  that  U  is  positive  semi-definite. 

Proof.  Since  V  is  symmetric  positive  semi-definite,  we  can  write  it  as  V  =  AAA'.  Take  a  vector 
z  G  and  rewrite  it  as  z  =  X]fc=i7fcOA:  where  are  eigenvectors  of  V  (no  need  for  uniqueness). 
From  positive  semi-definiteness  of  W  and  the  assumed  identity: 


0  <  z'Wz  =  z'UVz  +  z'VUz 

p  p 

=  (a'fcUV  aj  -h  a'^YUaj) 

j=i  k=i 
p 

i=i 

where  the  second  follows  from  the  cross  products  being  zero: 

a'-UVa/c  =  trace  (o'- UVa/c)  =  trace(afca'UV)  =  0,  whenever  j  /  k. 
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Now,  taking  in  particular  z  =  we  have,  for  every  k  =  1, ...  ,p: 


2a'fJJ\'ak  =  2Afca'^Uafc  =  a'fyVak  >  0 


We  can  conclude  that  for  every  k  having  >  0,  a'^Ua^  >  0  and  the  result  follows. 


□ 


B  Algorithms 


B.l  Appendix:  A  Path- Following  Algorithm  for  the  Cholesky  estimate 


In  this  section,  we  describe  a  path  tracing  algorithm  for  the  precision  matrix  estimate  based  on 


Cholesky  decomposition  introduced  in  Huang  et  al.  (20061.  The  algorithm  can  be  understood  as  a 


block- wise  coordinate  optimization  in  the  same  spirit  as  Friedman  et  al.  (2007). 


For  a  fixed  diagonal  matrix  =  diag  (d^, . . . ,  dp) ,  the  sparse  Cholesky  estimate  of  Huang  et  al. 


(2006)  is: 


tJ(A)  =  arg  min  XUD-2u'X'-fA||U||i, 
ueUUT 

where  UUT  denotes  the  space  of  upper  triangular  matrices  with  unit  diagonal.  This  is  equivalent 
to  solving: 


/3(A)  =  + 


It  is  not  hard  to  see  that  the  objective  function  can  be  broken  into  p  —  1  uncoupled  smalled 
components.  As  a  result,  the  optimization  problem  can  be  separated  into  p  —  1  smaller  problems, 
that  is,  /3(A)  =  ('/32(A), . . .  ,/3p(A)')  with: 


/5i(Ad|)  =  argmin||Xj -X;i=lXfc/3jfcf +  Ad2^^^il/3jfc|. 


j2 


Each  of  these  p—1  subproblems  can  have  its  path  regularization  traced  by  means  of  the  homotopy/LARS- 


LASSO  algorithm  in  Osborne  et  al.  (2000);  Efron  et  al.  (2004).  The  /3(A)  parameter  estimate  is 
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recovered  by  \^f32{Xd2),  ■  ■  ■ ,  f3p{Xdp)j .  All  is  needed  is  a  little  care  in  merging  the  p—1  paths  together 
as  the  scaling  of  the  regularization  parameter  changes  from  one  subproblem  to  the  next. 

An  alternative  way  to  understand  how  the  problem  can  be  broken  into  these  smaller  pieces 


stems  from  the  representation  of  program  in  (22)  as  a  linear  regression.  A  little  manipulation  can 


be  used  to  show  that  (22)  can  be  represented  as  a  linear  regression  of  Y  against  Z  as  below: 


Z  = 


Xi 

d!i 


Xa 


0  0 


Xi 


Y  = 


^  0  0  0  0 
0 
0 


0  0 


Xa 


0 

0 


0  0  0  0  0  0 
0  0  0  0  0  0 


Xa 

Xs 


d-l 


0 

0 


0 

0 

0 


0 

0 

0 


0 

0 

0 


x„ 


P-1 


and, 


The  separability  of  the  program  (22)  into  the  subprograms  (22)  follows  from  the  block  diagonal 


structure  of  the  matrix  Z'Z.  The  application  of  the  homotopy/LARS-LASSO  algorithm  to  each  of 
the  problems  and  the  subsequent  merging  of  the  resulting  paths  into  a  single  path  can  be  seen  as 


a  path  version  of  the  coordinate  wise  algorithms  described  in  Friedman  et  al.  (2007) 


C  Appendix:  Sampling  sparse  precision  matrices 

In  Section]^  we  use  three  randomly  selected  precision  matrices  in  the  simulation  studies  presented 
therein.  These  random  precision  matrices  are  sampled  as  follows. 

•  A  random  sample  containing  20  observations  of  X  is  sampled  from  a  zero-mean  Gaussian 
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distribution  with  precision  matrix  containing  2  along  its  main  diagonal  and  1  on  its  off- 
diagonal  entries; 

•  A  random  precision  matrix  is  formed  by  computing  G  =  {X^ X)~^; 

•  The  number  of  off-diagonal  terms  N  is  sampled  from  the  a  geometric  distribution  with  pa¬ 
rameter  A  =  0.05  conditioned  to  be  between  1  and 

•  A  new  matrix  H  is  formed  by  setting  all  off-diagonal  of  the  matrix  G  are  set  to  zero,  except 
for  N  randomly  selected  entries  (all  entries  are  equally  likely  to  be  picked); 

•  Since  H  may  not  be  positive  definite,  the  precision  matrix  is  formed  by  adding  H  +  I15  • 
max(0,0.02  —  (p{H))  where  ^p{H)  is  the  smallest  eigenvalue  of  H. 
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